date:20240109

(spark) branch master updated: [SPARK-46437][FOLLOWUP] Update configuration.md to use include_api_gen

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fcdfc8cbb24e [SPARK-46437][FOLLOWUP] Update configuration.md to use 
include_api_gen
fcdfc8cbb24e is described below

commit fcdfc8cbb24e7d1f96c8c3c684ef476576797e17
Author: Nicholas Chammas 
AuthorDate: Wed Jan 10 16:30:47 2024 +0900

[SPARK-46437][FOLLOWUP] Update configuration.md to use include_api_gen

### What changes were proposed in this pull request?

As part of #44630 I neglected to update some places that still use the 
following Liquid directive pattern:

```liquid
{% for static_file in site.static_files %}
{% if static_file.name == 'generated-agg-funcs-table.html' %}
{% include_relative generated-agg-funcs-table.html %}
{% break %}
{% endif %}
{% endfor %}
```

This PR replaces all remaining instances of this pattern with the new 
`include_api_gen` Jekyll tag.

### Why are the changes needed?

For consistency in how we build our docs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually building and reviewing the configuration docs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44663 from nchammas/configuration-include-api-gen.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 docs/configuration.md | 17 ++---
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index b45d647fde85..beb52c62d6c2 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3302,9 +3302,6 @@ Spark subsystems.
 
 ### Spark SQL
 
-{% for static_file in site.static_files %}
-{% if static_file.name == 'generated-runtime-sql-config-table.html' %}
-
  Runtime SQL Configuration
 
 Runtime SQL configurations are per-session, mutable Spark SQL configurations. 
They can be set with initial values by the config file
@@ -3312,13 +3309,7 @@ and command-line options with `--conf/-c` prefixed, or 
by setting `SparkConf` th
 Also, they can be set and queried by SET commands and rest to their initial 
values by RESET command,
 or by `SparkSession.conf`'s setter and getter methods in runtime.
 
-{% include_relative generated-runtime-sql-config-table.html %}
-{% break %}
-{% endif %}
-{% endfor %}
-
-{% for static_file in site.static_files %}
-{% if static_file.name == 'generated-static-sql-config-table.html' %}
+{% include_api_gen generated-runtime-sql-config-table.html %}
 
  Static SQL Configuration
 
@@ -3326,11 +3317,7 @@ Static SQL configurations are cross-session, immutable 
Spark SQL configurations.
 and command-line options with `--conf/-c` prefixed, or by setting `SparkConf` 
that are used to create `SparkSession`.
 External users can query the static sql config values via `SparkSession.conf` 
or via set command, e.g. `SET spark.sql.extensions;`, but cannot set/unset them.
 
-{% include_relative generated-static-sql-config-table.html %}
-{% break %}
-{% endif %}
-{% endfor %}
-
+{% include_api_gen generated-static-sql-config-table.html %}
 
 ### Spark Streaming
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (a3991b17e379 -> 63758177e9c6)

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a3991b17e379 [SPARK-46648][SQL] Use `zstd` as the default ORC 
compression
 add 63758177e9c6 [SPARK-46651][PS][TESTS] Split `FrameTakeTests`

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py|  2 +
 .../pandas/tests/connect/frame/test_parity_take.py | 11 ++--
 .../test_parity_take_adv.py}   |  8 +--
 python/pyspark/pandas/tests/frame/test_take.py | 61 +++---
 .../tests/frame/{test_take.py => test_take_adv.py} | 56 
 5 files changed, 27 insertions(+), 111 deletions(-)
 copy python/pyspark/pandas/tests/connect/{indexes/test_parity_append.py => 
frame/test_parity_take_adv.py} (87%)
 copy python/pyspark/pandas/tests/frame/{test_take.py => test_take_adv.py} (64%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46648][SQL] Use `zstd` as the default ORC compression

2024-01-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a3991b17e379 [SPARK-46648][SQL] Use `zstd` as the default ORC 
compression
a3991b17e379 is described below

commit a3991b17e3790aded41dea1160b50ac605275c81
Author: Dongjoon Hyun 
AuthorDate: Tue Jan 9 21:10:38 2024 -0800

[SPARK-46648][SQL] Use `zstd` as the default ORC compression

### What changes were proposed in this pull request?

This PR aims to use `zstd` as the default ORC compression.

Note that Apache ORC v2.0 also uses `zstd` as the default compression via 
[ORC-1577](https://issues.apache.org/jira/browse/ORC-1577).

The following was the presentation about the usage of ZStandard.
- _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_
- 
[Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro)
- [Youtube](https://youtu.be/dTGxhHwjONY)

### Why are the changes needed?

In general, `ZStandard` is better in terms of the file size.
```
$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-snappy/ --recursive 
--summarize --human-readable | tail -n1
   Total Size: 2.8 GiB

$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-zstd/ --recursive 
--summarize --human-readable | tail -n1
   Total Size: 2.4 GiB
```

As a result, the performance is also better in general in the cloud storage 
.

```
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location 
s3a://dongjoon/orc2/tpcds-sf-1-orc-snappy"
...
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 2 iterations, 5712 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] q1 2708   
2856 210  0.25869.3   1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q2
[info]   Stopped after 2 iterations, 7006 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] q2 3424   
3503 113  0.71533.9   1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q3
[info]   Stopped after 2 iterations, 6577 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] q3 3146   
3289 202  0.91059.0   1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q4
[info]   Stopped after 2 iterations, 36228 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] q417592  
18114 738  0.33375.5   1.0X
...
```

```
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location 
s3a://dongjoon/orc2/tpcds-sf-1-orc-zstd"
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 2 iterations, 5235 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info]

(spark) branch master updated: [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC

2024-01-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 85b504d64701 [SPARK-46442][SQL] DS V2 supports push down 
PERCENTILE_CONT and PERCENTILE_DISC
85b504d64701 is described below

commit 85b504d64701ca470b946841ca5b2b4e129293c1
Author: Jiaan Geng 
AuthorDate: Wed Jan 10 12:24:24 2024 +0800

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and 
PERCENTILE_DISC

### What changes were proposed in this pull request?
This PR will translate the aggregate function `PERCENTILE_CONT` and 
`PERCENTILE_DISC` for pushdown.

- This PR adds `Expression[] orderingWithinGroups` into 
`GeneralAggregateFunc`, so as DS V2 pushdown framework could compile the 
`WITHIN GROUP (ORDER BY ...)` easily.

- This PR also split `visitInverseDistributionFunction` from 
`visitAggregateFunction`, so as DS V2 pushdown framework could generate the 
syntax `WITHIN GROUP (ORDER BY ...)` easily.

- This PR also fix a bug that `JdbcUtils` can't treat the precision and 
scale of decimal returned from JDBC.

### Why are the changes needed?
DS V2 supports push down `PERCENTILE_CONT` and `PERCENTILE_DISC`.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44397 from beliefer/SPARK-46442.

Lead-authored-by: Jiaan Geng 
Co-authored-by: beliefer 
Signed-off-by: Wenchen Fan 
---
 .../aggregate/GeneralAggregateFunc.java| 21 +++-
 .../sql/connector/util/V2ExpressionSQLBuilder.java | 21 +++-
 .../sql/catalyst/util/V2ExpressionBuilder.scala| 20 +--
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  | 15 +-
 .../org/apache/spark/sql/jdbc/JdbcDialects.scala   | 17 +-
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 62 --
 6 files changed, 132 insertions(+), 24 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java
index 4d787eaf9644..d287288ba33f 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java
@@ -21,6 +21,7 @@ import java.util.Arrays;
 
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.connector.expressions.Expression;
+import org.apache.spark.sql.connector.expressions.SortValue;
 import org.apache.spark.sql.internal.connector.ExpressionWithToString;
 
 /**
@@ -41,7 +42,9 @@ import 
org.apache.spark.sql.internal.connector.ExpressionWithToString;
  *  REGR_R2(input1, input2) Since 3.4.0
  *  REGR_SLOPE(input1, input2) Since 3.4.0
  *  REGR_SXY(input1, input2) Since 3.4.0
- *  MODE(input1[, inverse]) Since 4.0.0
+ *  MODE() WITHIN (ORDER BY input1 [ASC|DESC]) Since 4.0.0
+ *  PERCENTILE_CONT(input1) WITHIN (ORDER BY input2 [ASC|DESC]) 
Since 4.0.0
+ *  PERCENTILE_DISC(input1) WITHIN (ORDER BY input2 [ASC|DESC]) 
Since 4.0.0
  * 
  *
  * @since 3.3.0
@@ -51,11 +54,21 @@ public final class GeneralAggregateFunc extends 
ExpressionWithToString implement
   private final String name;
   private final boolean isDistinct;
   private final Expression[] children;
+  private final SortValue[] orderingWithinGroups;
 
   public GeneralAggregateFunc(String name, boolean isDistinct, Expression[] 
children) {
 this.name = name;
 this.isDistinct = isDistinct;
 this.children = children;
+this.orderingWithinGroups = new SortValue[]{};
+  }
+
+  public GeneralAggregateFunc(
+  String name, boolean isDistinct, Expression[] children, SortValue[] 
orderingWithinGroups) {
+this.name = name;
+this.isDistinct = isDistinct;
+this.children = children;
+this.orderingWithinGroups = orderingWithinGroups;
   }
 
   public String name() { return name; }
@@ -64,6 +77,8 @@ public final class GeneralAggregateFunc extends 
ExpressionWithToString implement
   @Override
   public Expression[] children() { return children; }
 
+  public SortValue[] orderingWithinGroups() { return orderingWithinGroups; }
+
   @Override
   public boolean equals(Object o) {
 if (this == o) return true;
@@ -73,7 +88,8 @@ public final class GeneralAggregateFunc extends 
ExpressionWithToString implement
 
 if (isDistinct != that.isDistinct) return false;
 if (!name.equals(that.name)) return false;
-return Arrays.equals(children, that.children);
+if (!Arrays.equals(children, that.children)) return false;
+return

(spark) branch master updated: [MINOR][DOCS] Correct the usage example of Dataset in Java

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d2f572428be5 [MINOR][DOCS] Correct the usage example of Dataset in Java
d2f572428be5 is described below

commit d2f572428be5346dfa412f6588e72686429ddc71
Author: aiden 
AuthorDate: Wed Jan 10 12:57:40 2024 +0900

[MINOR][DOCS] Correct the usage example of Dataset in Java

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

Closes #44342 from Aiden-Dong/aiden-dev.

Lead-authored-by: aiden 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 31e1495db7e3..ff1bd8c73e6f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -138,7 +138,8 @@ private[sql] object Dataset {
  * the following creates a new Dataset by applying a filter on the existing 
one:
  * {{{
  *   val names = people.map(_.name)  // in Scala; names is a Dataset[String]
- *   Dataset names = people.map((Person p) -> p.name, 
Encoders.STRING));
+ *   Dataset names = people.map(
+ * (MapFunction) p -> p.name, Encoders.STRING()); // Java
  * }}}
  *
  * Dataset operations can also be untyped, through various 
domain-specific-language (DSL)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46649][PYTHON][INFRA] Run PyPy 3 and Python 3.10 tests independently

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f526bea1dc0c [SPARK-46649][PYTHON][INFRA] Run PyPy 3 and Python 3.10 
tests independently
f526bea1dc0c is described below

commit f526bea1dc0c2d744566f73212000e205f4ecec9
Author: Hyukjin Kwon 
AuthorDate: Wed Jan 10 12:26:56 2024 +0900

[SPARK-46649][PYTHON][INFRA] Run PyPy 3 and Python 3.10 tests independently

### What changes were proposed in this pull request?

This PR proposes to split PyPy 3 and Python 3.10 builds

### Why are the changes needed?

https://github.com/apache/spark/actions/runs/7462843546/job/20306241275

Seems like it terminates in the middle because of OOM. we should split

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

CI should verify the change.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44655 from HyukjinKwon/SPARK-46649.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_python.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python.yml
index ebd8de2311c6..a2cf7c64f089 100644
--- a/.github/workflows/build_python.yml
+++ b/.github/workflows/build_python.yml
@@ -27,7 +27,7 @@ jobs:
   run-build:
 strategy:
   matrix:
-pyversion: ["pypy3,python3.10", "python3.11", "python3.12"]
+pyversion: ["pypy3", "python3.10", "python3.11", "python3.12"]
 permissions:
   packages: write
 name: Run


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (0791e9f302fb -> 4957c1a5fd42)

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0791e9f302fb [SPARK-46536][SQL] Support GROUP BY calendar_interval_type
 add 4957c1a5fd42 [SPARK-46645][INFRA] Exclude unittest-xml-reporting in 
Python 3.12 image

No new revisions were added by this update.

Summary of changes:
 dev/infra/Dockerfile | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46536][SQL] Support GROUP BY calendar_interval_type

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0791e9f302fb [SPARK-46536][SQL] Support GROUP BY calendar_interval_type
0791e9f302fb is described below

commit 0791e9f302fba547fb1e6d8386bf7b55e26aa22e
Author: Stefan Kandic 
AuthorDate: Wed Jan 10 12:09:37 2024 +0900

[SPARK-46536][SQL] Support GROUP BY calendar_interval_type

### What changes were proposed in this pull request?

Allow group by on columns of type CalendarInterval

### Why are the changes needed?

Currently, Spark GROUP BY only allows orderable data types, otherwise the 
plan analysis fails: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203

However, this is too strict as GROUP BY only cares about equality, not 
ordering. The CalendarInterval type is not orderable (1 month and 30 days, we 
don't know which one is larger), but has well-defined equality. In fact, we 
already support `SELECT DISTINCT calendar_interval_type` in some cases (when 
hash aggregate is picked by the planner).

### Does this PR introduce _any_ user-facing change?

Yes, users will now be able to do group by on columns of type 
CalendarInterval

### How was this patch tested?

By adding new UTs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44538 from stefankandic/SPARK-46536-groupby-calendarInterval.

Lead-authored-by: Stefan Kandic 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/unsafe/types/CalendarInterval.java   | 24 +++-
 .../spark/unsafe/types/CalendarIntervalSuite.java  | 16 +++
 .../spark/sql/catalyst/expressions/ExprUtils.scala |  2 +-
 .../expressions/codegen/CodeGenerator.scala|  2 ++
 .../spark/sql/execution/aggregate/AggUtils.scala   | 15 --
 .../sql/execution/aggregate/HashMapGenerator.scala |  1 +
 .../apache/spark/sql/DataFrameAggregateSuite.scala | 32 ++
 7 files changed, 88 insertions(+), 4 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
index f2d06e793f9d..b567ac302b84 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
@@ -44,7 +44,7 @@ import static 
org.apache.spark.sql.catalyst.util.DateTimeConstants.*;
  * @since 3.0.0
  */
 @Unstable
-public final class CalendarInterval implements Serializable {
+public final class CalendarInterval implements Serializable, 
Comparable {
   // NOTE: If you're moving or renaming this file, you should also update 
Unidoc configuration
   // specified in 'SparkBuild.scala'.
   public final int months;
@@ -127,4 +127,26 @@ public final class CalendarInterval implements 
Serializable {
* @throws ArithmeticException if a numeric overflow occurs
*/
   public Duration extractAsDuration() { return Duration.of(microseconds, 
ChronoUnit.MICROS); }
+
+  /**
+   * This method is not used to order CalendarInterval instances, as they are 
not orderable and
+   * cannot be used in a ORDER BY statement.
+   * Instead, it is used to find identical interval instances for aggregation 
purposes.
+   * It compares the 'months', 'days', and 'microseconds' fields of this 
CalendarInterval
+   * with another instance. The comparison is done first on the 'months', then 
on the 'days',
+   * and finally on the 'microseconds'.
+   *
+   * @param o The CalendarInterval instance to compare with.
+   * @return Zero if this object is equal to the specified object, and 
non-zero otherwise
+   */
+  @Override
+  public int compareTo(CalendarInterval o) {
+if (this.months != o.months) {
+  return Integer.compare(this.months, o.months);
+} else if (this.days != o.days) {
+  return Integer.compare(this.days, o.days);
+} else {
+  return Long.compare(this.microseconds, o.microseconds);
+}
+  }
 }
diff --git 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CalendarIntervalSuite.java
 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CalendarIntervalSuite.java
index b8b710523365..0a1ee279316f 100644
--- 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CalendarIntervalSuite.java
+++ 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CalendarIntervalSuite.java
@@ -76,6 +76,22 @@ public class CalendarIntervalSuite {
   i.toString());
   }
 
+  @Test
+  public void compareToTest() {
+   CalendarInterval i = new CalendarInterval(0, 0, 0);
+
+   assertEquals(i.compareTo(new

(spark) branch master updated: [SPARK-46646][SQL][TESTS] Improve `TPCDSQueryBenchmark` to support other file formats

2024-01-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 24cb6117ba4a [SPARK-46646][SQL][TESTS] Improve `TPCDSQueryBenchmark` 
to support other file formats
24cb6117ba4a is described below

commit 24cb6117ba4a477e4e2c82ba1a17d799f13a623c
Author: Dongjoon Hyun 
AuthorDate: Tue Jan 9 18:50:53 2024 -0800

[SPARK-46646][SQL][TESTS] Improve `TPCDSQueryBenchmark` to support other 
file formats

### What changes were proposed in this pull request?

This PR aims to improve `TPCDSQueryBenchmark` to support other file formats.

### Why are the changes needed?

Currently, `parquet` is a hard-coded because it's the default value of 
`spark.sql.sources.default`.


https://github.com/apache/spark/blob/48d22e9f876f070d35ff3dd011bfbd1b6bccb4ac/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala#L77

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual.

**BEFORE**
```
$ build/sbt "sql/Test/runMain 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location 
/tmp/tpcds-sf-1-orc-snappy/"
...
[info] 18:36:39.698 ERROR org.apache.spark.executor.Executor: Exception in 
task 0.0 in stage 0.0 (TID 0)
[info] java.lang.RuntimeException: 
file:/tmp/tpcds-sf-1-orc-snappy/catalog_page/part-0-40446d2a-f814-4e26-b3e1-664b833bf041-c000.snappy.orc
 is not a Parquet file. Expected magic number at tail, but found [79, 82, 67, 
25]
[info]  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
...
```

**AFTER**
```
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location 
/tmp/tpcds-sf-1-orc-snappy/"
...
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 6 iterations, 2028 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] q1  305
338  24  1.5 660.4   1.0X
```

### Was this patch authored or co-authored using generative AI tooling?

Closes #44651 from dongjoon-hyun/SPARK-46646.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
index 721997d84e1a..1ff6122906d1 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
@@ -74,7 +74,8 @@ object TPCDSQueryBenchmark extends SqlBasedBenchmark with 
Logging {
 tables.map { tableName =>
   spark.sql(s"DROP TABLE IF EXISTS $tableName")
   val options = Map("path" -> s"$dataLocation/$tableName")
-  spark.catalog.createTable(tableName, "parquet", tableColumns(tableName), 
options)
+  val format = spark.conf.get("spark.sql.sources.default")
+  spark.catalog.createTable(tableName, format, tableColumns(tableName), 
options)
   // Recover partitions but don't fail if a table is not partitioned.
   Try {
 spark.sql(s"ALTER TABLE $tableName RECOVER PARTITIONS")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join

2024-01-09 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 686f428dc104 [SPARK-46541][SQL][CONNECT] Fix the ambiguous column 
reference in self join
686f428dc104 is described below

commit 686f428dc10410e95d4421d4cbe0dd509335c9f2
Author: Ruifeng Zheng 
AuthorDate: Wed Jan 10 10:38:00 2024 +0800

[SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join

### What changes were proposed in this pull request?
fix the logic of ambiguous column detection in spark connect

### Why are the changes needed?
```
In [24]: df1 = spark.range(10).withColumn("a", sf.lit(0))

In [25]: df2 = df1.withColumnRenamed("a", "b")

In [26]: df1.join(df2, df1["a"] == df2["b"])
Out[26]: 23/12/22 09:33:28 ERROR ErrorUtils: Spark Connect RPC error 
during: analyze. UserId: ruifeng.zheng. SessionId: 
eaa2161f-4b64-4dbf-9809-af6b696d3005.
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_COLUMN_REFERENCE] Column 
a is ambiguous. It's because you joined several DataFrame together, and some of 
these DataFrames are the same.
This column points to one of the DataFrame but Spark is unable to figure 
out which one.
Please alias the DataFrames with different names via DataFrame.alias before 
joining them,
and specify the column using qualified name, e.g. 
df.alias("a").join(df.alias("b"), col("a.id") > col("b.id")). SQLSTATE: 42702
at 
org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.findPlanById(ColumnResolutionHelper.scala:555)
at

```

### Does this PR introduce _any_ user-facing change?
yes, fix a bug

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44532 from zhengruifeng/sql_connect_find_plan_id.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../src/main/resources/error/error-classes.json|  11 +-
 .../org/apache/spark/sql/ClientE2ETestSuite.scala  |   2 +-
 docs/sql-error-conditions.md   |   6 +
 .../sql/tests/connect/test_connect_basic.py|  13 +-
 python/pyspark/sql/tests/test_dataframe.py |   9 +-
 .../catalyst/analysis/ColumnResolutionHelper.scala | 139 -
 .../spark/sql/errors/QueryCompilationErrors.scala  |  18 ++-
 7 files changed, 133 insertions(+), 65 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index c7f8f59a7679..e770b9c7053e 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -324,6 +324,12 @@
 ],
 "sqlState" : "0AKD0"
   },
+  "CANNOT_RESOLVE_DATAFRAME_COLUMN" : {
+"message" : [
+  "Cannot resolve dataframe column . It's probably because of 
illegal references like `df1.select(df2.col(\"a\"))`."
+],
+"sqlState" : "42704"
+  },
   "CANNOT_RESOLVE_STAR_EXPAND" : {
 "message" : [
   "Cannot resolve .* given input columns . Please 
check that the specified table or struct exists and is accessible in the input 
columns."
@@ -6843,11 +6849,6 @@
   "Cannot modify the value of a static config: "
 ]
   },
-  "_LEGACY_ERROR_TEMP_3051" : {
-"message" : [
-  "When resolving , fail to find subplan with plan_id= in "
-]
-  },
   "_LEGACY_ERROR_TEMP_3052" : {
 "message" : [
   "Unexpected resolved action: "
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala
index 0740334724e8..288964a084ba 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala
@@ -894,7 +894,7 @@ class ClientE2ETestSuite extends RemoteSparkSession with 
SQLHelper with PrivateM
   // df1("i") is not ambiguous, but it's not valid in the projected df.
   df1.select((df1("i") + 1).as("plus")).select(df1("i")).collect()
 }
-
assert(e1.getMessage.contains("MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_MISSING_FROM_INPUT"))
+assert(e1.getMessage.contains("UNRESOLVED_COLUMN.WITH_SUGGESTION"))
 
 checkSameResult(
   Seq(Row(1, "a")),
diff --git a/docs/sql-error-conditions.md b/docs/sql-error-conditions.md
index f58b7f607a0b..db8ecf5b2a30 100644
--- a/docs/sql-error-conditions.md
+++ b/docs/sql-error-conditions.md
@@ -282,6 +282,12 @@ Cannot recognize hive type string: ``, column: 
``. The spe
 
 Renaming a `` across schemas is not allowed.
 
+###

(spark) branch master updated: [SPARK-46643][SQL][TESTS] Fix ORC tests to be independent from default compression

2024-01-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 48d22e9f876 [SPARK-46643][SQL][TESTS] Fix ORC tests to be independent 
from default compression
48d22e9f876 is described below

commit 48d22e9f876f070d35ff3dd011bfbd1b6bccb4ac
Author: Dongjoon Hyun 
AuthorDate: Tue Jan 9 18:00:22 2024 -0800

[SPARK-46643][SQL][TESTS] Fix ORC tests to be independent from default 
compression

### What changes were proposed in this pull request?

This PR aims to fix ORC tests to be independent from the change of default 
ORC compression.

### Why are the changes needed?

Currently, a few test cases have implicit assumption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44648 from dongjoon-hyun/SPARK-46643.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/datasources/orc/OrcQuerySuite.scala|  2 +-
 .../spark/sql/execution/datasources/orc/OrcSourceSuite.scala   |  5 +++--
 .../apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala   | 10 --
 3 files changed, 4 insertions(+), 13 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
index 7d666729bb4..3f3776bab8f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
@@ -508,7 +508,7 @@ abstract class OrcQueryTest extends OrcTest {
   conf.setBoolean("hive.io.file.read.all.columns", false)
 
   val orcRecordReader = {
-val file = new 
File(path).listFiles().find(_.getName.endsWith(".snappy.orc")).head
+val file = new 
File(path).listFiles().find(_.getName.endsWith(".orc")).head
 val split = new FileSplit(new Path(file.toURI), 0, file.length, 
Array.empty[String])
 val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
 val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
index 1e98099361d..6166773fb09 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
@@ -332,8 +332,9 @@ abstract class OrcSuite
 
   test("SPARK-21839: Add SQL config for ORC compression") {
 val conf = spark.sessionState.conf
-// Test if the default of spark.sql.orc.compression.codec is snappy
-assert(new OrcOptions(Map.empty[String, String], conf).compressionCodec == 
SNAPPY.name())
+// Test if the default of spark.sql.orc.compression.codec is used.
+assert(new OrcOptions(Map.empty[String, String], conf).compressionCodec ==
+SQLConf.ORC_COMPRESSION.defaultValueString.toUpperCase(Locale.ROOT))
 
 // OrcOptions's parameters have a higher priority than SQL configuration.
 // `compression` -> `orc.compression` -> `spark.sql.orc.compression.codec`
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala
index aa2f110ceac..071035853b6 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala
@@ -107,16 +107,6 @@ class OrcHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   checkAnswer(df, copyDf)
 }
   }
-
-  test("Default compression codec is snappy for ORC compression") {
-withTempPath { file =>
-  spark.range(0, 10).write
-.orc(file.getCanonicalPath)
-  val expectedCompressionKind =
-OrcFileOperator.getFileReader(file.getCanonicalPath).get.getCompression
-  assert(OrcCompressionCodec.SNAPPY.name() === 
expectedCompressionKind.name())
-}
-  }
 }
 
 class HiveOrcHadoopFsRelationSuite extends OrcHadoopFsRelationSuite {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][DOCS] Add license header at docs/_plugins

2024-01-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d613772765f [MINOR][DOCS] Add license header at docs/_plugins
d613772765f is described below

commit d613772765fcfb953435e8710279b21c0e87261a
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 9 17:41:32 2024 -0800

[MINOR][DOCS] Add license header at docs/_plugins

### What changes were proposed in this pull request?

This PR adds license header to `docs/_plugins` files.

### Why are the changes needed?

To comply Apache License 2.0

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Existing CI should verify it e.g., linter.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44647 from HyukjinKwon/minor-license.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 docs/_plugins/conditonal_includes.rb | 16 
 docs/_plugins/production_tag.rb  | 16 
 2 files changed, 32 insertions(+)

diff --git a/docs/_plugins/conditonal_includes.rb 
b/docs/_plugins/conditonal_includes.rb
index 39280cbe5be..7c03a224b34 100644
--- a/docs/_plugins/conditonal_includes.rb
+++ b/docs/_plugins/conditonal_includes.rb
@@ -1,3 +1,19 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
 module Jekyll
   # Tag for including a file if it exists.
   class IncludeRelativeIfExistsTag < Tags::IncludeRelativeTag
diff --git a/docs/_plugins/production_tag.rb b/docs/_plugins/production_tag.rb
index 9f870cf2137..de860cf22ef 100644
--- a/docs/_plugins/production_tag.rb
+++ b/docs/_plugins/production_tag.rb
@@ -1,3 +1,19 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
 module Jekyll
   class ProductionTag < Liquid::Block
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website

2024-01-09 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new d3e30848084 [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark 
doc website
d3e30848084 is described below

commit d3e3084808453769ba0cd4278ee8650e40c185ea
Author: Gengliang Wang 
AuthorDate: Wed Jan 10 09:32:30 2024 +0900

[SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website

### What changes were proposed in this pull request?

Enhance the Visual Appeal of Spark doc website after 
https://github.com/apache/spark/pull/40269:
 1. There is a weird indent on the top right side of the first 
paragraph of the Spark 3.5.0 doc overview page
Before this PR
https://github.com/apache/spark/assets/1097932/84d21ca1-a4d0-4bd4-8f20-a34fa5db4000;>

After this PR:
https://github.com/apache/spark/assets/1097932/4ffc0d5a-ed75-44c5-b20a-475ff401afa8;>

 2. All the titles are too big and therefore less readable. In the 
website https://spark.apache.org/downloads.html, titles are h2 while in doc 
site https://spark.apache.org/docs/latest/ titles are h1. So we should make the 
font size of titles smaller.
Before this PR:
https://github.com/apache/spark/assets/1097932/5bbbd9eb-432a-42c0-98be-ff00a9099cd6;>
After this PR:
https://github.com/apache/spark/assets/1097932/dc94c1fb-6ac1-41a8-b4a4-19b3034125d7;>

 3. The banner image can't be displayed correct. Even when it shows up, 
it will be hover by the text. To make it simple, let's not show the banner 
image as we did in https://spark.apache.org/docs/3.4.2/
https://github.com/apache/spark/assets/1097932/f6d34261-a352-44e2-9633-6e96b311a0b3;>
https://github.com/apache/spark/assets/1097932/c49ce6b6-13d9-4d8f-97a9-7ed8b037be57;>

### Why are the changes needed?

Improve the Visual Appeal of Spark doc website

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually build doc and verify on local setup.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44642 from gengliangwang/enhance_doc.

Authored-by: Gengliang Wang 
Signed-off-by: Hyukjin Kwon 
---
 docs/_layouts/global.html  |  26 +++---
 docs/css/custom.css|  35 ++-
 docs/img/spark-hero-thin-light.jpg | Bin 278664 -> 0 bytes
 3 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 8c4435fdf31..5116472eaa7 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -138,25 +138,21 @@
 
 {% if page.url == "/" %}
 
-
-
 
 
   Apache Spark - A Unified 
engine for large-scale data analytics
 
-
-  
-Apache Spark is a unified analytics engine for large-scale 
data processing.
-It provides high-level APIs in Java, Scala, Python and R,
-and an optimized engine that supports general execution 
graphs.
-It also supports a rich set of higher-level tools including
-Spark SQL for SQL 
and structured data processing,
-pandas API on Spark 
for pandas workloads,
-MLlib for machine learning,
-GraphX for 
graph processing,
- and Structured Streaming
- for incremental computation and stream processing.
-  
+
+  Apache Spark is a unified analytics engine for large-scale 
data processing.
+  It provides high-level APIs in Java, Scala, Python and R,
+  and an optimized engine that supports general execution 
graphs.
+  It also supports a rich set of higher-level tools including
+  Spark SQL for SQL 
and structured data processing,
+  pandas API on Spark 
for pandas workloads,
+  MLlib for machine learning,
+  GraphX for graph 
processing,
+   and Structured Streaming
+   for incremental computation and stream processing.
 
 
   
diff --git a/docs/css/custom.css b/docs/css/custom.css
index 1239c0ed440..8158938866c 100644
--- a/docs/css/custom.css
+++ b/docs/css/custom.css
@@ -95,18 +95,7 @@ section {
   border-color: transparent;
 }
 
-.hero-banner .bg {
-  background: url(/img/spark-hero-thin-light.jpg) no-repeat;
-  transform: translate(36%, 0%);
-  height: 475px;
-  top: 0;
-  position: absolute;
-  right:

(spark) branch master updated: [SPARK-37039][PS] Fix `Series.astype` to work properly with missing value

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0a66be8ded3 [SPARK-37039][PS] Fix `Series.astype` to work properly 
with missing value
0a66be8ded3 is described below

commit 0a66be8ded379ca056e4b4d1e4354fe8916b04ba
Author: Haejoon Lee 
AuthorDate: Wed Jan 10 09:54:56 2024 +0900

[SPARK-37039][PS] Fix `Series.astype` to work properly with missing value

### What changes were proposed in this pull request?

This PR proposes to fix `Series.astype` to work properly with missing value.

### Why are the changes needed?

To follow the behavior of latest Pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the bug is fixed to follow the behavior of Pandas:

**Before**
```python
>>> psser = ps.Series([decimal.Decimal(1), decimal.Decimal(2), 
decimal.Decimal(np.nan)])
>>> psser.astype(bool)
0True
1True
2False
dtype: bool
```

**After**
```python
>>> psser = ps.Series([decimal.Decimal(1), decimal.Decimal(2), 
decimal.Decimal(np.nan)])
>>> psser.astype(bool)
0True
1True
2True
dtype: bool
```

### How was this patch tested?

Enable the existing UTs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44570 from itholic/SPARK-37039.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/data_type_ops/base.py   | 5 -
 python/pyspark/pandas/tests/data_type_ops/test_as_type.py | 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/base.py 
b/python/pyspark/pandas/data_type_ops/base.py
index 5a4cd7a1eb0..2df40252965 100644
--- a/python/pyspark/pandas/data_type_ops/base.py
+++ b/python/pyspark/pandas/data_type_ops/base.py
@@ -150,7 +150,10 @@ def _as_bool_type(index_ops: IndexOpsLike, dtype: Dtype) 
-> IndexOpsLike:
 if isinstance(dtype, extension_dtypes):
 scol = index_ops.spark.column.cast(spark_type)
 else:
-scol = F.when(index_ops.spark.column.isNull(), F.lit(False)).otherwise(
+null_value = (
+F.lit(True) if isinstance(index_ops.spark.data_type, DecimalType) 
else F.lit(False)
+)
+scol = F.when(index_ops.spark.column.isNull(), null_value).otherwise(
 index_ops.spark.column.cast(spark_type)
 )
 return index_ops._with_new_scol(
diff --git a/python/pyspark/pandas/tests/data_type_ops/test_as_type.py 
b/python/pyspark/pandas/tests/data_type_ops/test_as_type.py
index b27cbceac8f..379d055d585 100644
--- a/python/pyspark/pandas/tests/data_type_ops/test_as_type.py
+++ b/python/pyspark/pandas/tests/data_type_ops/test_as_type.py
@@ -55,10 +55,7 @@ class AsTypeTestsMixin:
 lambda: psser.astype(int_type),
 )
 
-# TODO(SPARK-37039): the np.nan series.astype(bool) should be True
-if not pser.hasnans:
-self.assert_eq(pser.astype(bool), psser.astype(bool))
-
+self.assert_eq(pser.astype(bool), psser.astype(bool))
 self.assert_eq(pser.astype(float), psser.astype(float))
 self.assert_eq(pser.astype(np.float32), psser.astype(np.float32))
 self.assert_eq(pser.astype(str), psser.astype(str))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][INFRA] Ensure that docs build successfully with SKIP_API=1

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eb111901608 [MINOR][INFRA] Ensure that docs build successfully with 
SKIP_API=1
eb111901608 is described below

commit eb11190160888495f3abc5b802f44ed9e53e
Author: Nicholas Chammas 
AuthorDate: Wed Jan 10 09:48:39 2024 +0900

[MINOR][INFRA] Ensure that docs build successfully with SKIP_API=1

### What changes were proposed in this pull request?

This PR tweaks the docs build so that the general docs are first built with 
`SKIP_API=1` to ensure that the docs build works without any language being 
built beforehand.

### Why are the changes needed?

[Committers expect][1] docs to build with `SKIP_API=1` on a fresh checkout. 
Yet, our CI build does not ensure this. This PR corrects this gap.

[1]: https://github.com/apache/spark/pull/44393/files#r1444169083

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Via test commits against this PR.

[The build now fails][f] if the docs reference an include that has not been 
generated yet.

[f]: 
https://github.com/nchammas/spark/actions/runs/7450949388/job/20271048581#step:30:29

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44627 from nchammas/skip-api-docs-build.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index a93a70e8616..012e7c8fe9e 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -761,6 +761,9 @@ jobs:
   run: ./dev/lint-r
 - name: Run documentation build
   run: |
+# Build docs first with SKIP_API to ensure they are buildable without 
requiring any
+# language docs to be built beforehand.
+cd docs; SKIP_API=1 bundle exec jekyll build; cd ..
 if [ -f "./dev/is-changed.py" ]; then
   # Skip PySpark and SparkR docs while keeping Scala/Java/SQL docs
   pyspark_modules=`cd dev && python3.9 -c "import 
sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if 
m.name.startswith('pyspark')))"`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3a6b9adc21c [SPARK-46633][SQL] Fix Avro reader to handle zero-length 
blocks
3a6b9adc21c is described below

commit 3a6b9adc21c25b01746cc31f3b75fe061a63204c
Author: Ivan Sadikov 
AuthorDate: Wed Jan 10 09:47:14 2024 +0900

[SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks

### What changes were proposed in this pull request?

This PR fixes a bug in Avro connector with regard to zero-length blocks. If 
a file contains one of these blocks, the Avro connector may return an incorrect 
number of records or even an empty DataFrame in some cases.

This was due to the way the `hasNextRow` check worked. `hasNext` method in 
Avro loads the next block so if the block is empty, it would return false and 
Avro connector will stop reading rows. However, we should continue checking the 
next block instead until the sync point.

### Why are the changes needed?

Fixes a correctness bug in Avro connector.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test and a generated sample file to verify the fix. Without 
the patch, reading such file would return fewer records or 0 compared to the 
actual number (depends on the maxPartitionBytes config).

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44635 from sadikovi/SPARK-46633.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
---
 .../main/scala/org/apache/spark/sql/avro/AvroUtils.scala|  9 ++---
 connector/avro/src/test/resources/empty_blocks.avro |  5 +
 .../test/scala/org/apache/spark/sql/avro/AvroSuite.scala| 13 +
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
index 27a5b918fc9..25e6aec4d84 100644
--- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
+++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
@@ -182,16 +182,19 @@ private[sql] object AvroUtils extends Logging {
 
 def hasNextRow: Boolean = {
   while (!completed && currentRow.isEmpty) {
-val r = fileReader.hasNext && !fileReader.pastSync(stopPosition)
-if (!r) {
+if (fileReader.pastSync(stopPosition)) {
   fileReader.close()
   completed = true
   currentRow = None
-} else {
+} else if (fileReader.hasNext()) {
   val record = fileReader.next()
   // the row must be deserialized in hasNextRow, because 
AvroDeserializer#deserialize
   // potentially filters rows
   currentRow = 
deserializer.deserialize(record).asInstanceOf[Option[InternalRow]]
+} else {
+  // In this case, `fileReader.hasNext()` returns false but we are not 
past sync point yet.
+  // This means empty blocks, we need to continue reading the file in 
case there are non
+  // empty blocks or we are past sync point.
 }
   }
   currentRow.isDefined
diff --git a/connector/avro/src/test/resources/empty_blocks.avro 
b/connector/avro/src/test/resources/empty_blocks.avro
new file mode 100644
index 000..85d96f4af71
--- /dev/null
+++ b/connector/avro/src/test/resources/empty_blocks.avro
@@ -0,0 +1,5 @@
+Obj�6decoder.shape.version.patch0,decoder.shape.commitidP32d3df6520fbab9f829c638602bfcaf57a36af3a2decoder.shape.fingerprint�18b0f3cb011ab5450a2eb866a48017b178f421c495cfea8e014a892a2e0af6c4$decoder.shape.usid
 ea88d6ea20b60173cmesg_shape.id118 
cmesg_shape.name,testAvroMessageAAA(cmesg_shape.bytesize21.cmesg_shape.fingerprint�5d0906d6748d14d2a4b65b5a18dc05ba39e2def91236c3d0d08c577afe38b0180decoder.software.version==1_weoiwasd2weroqw_asdmjkjsdf_2p1_gcc.file.path�/tmp/aa/bbb/c/split-s/5d0f6168_20231107134036/20231107161711.653/tmp_28dycwr-asd-ed-123-234-128-2..file.mode
+batch 
.file.st_ino(-4344181839388196375".file.st_size169201408$.file.st_mtime.1998-02-15
 16:26:43.000_col_15111.222.333.4xyz_idHe4ed94eb-bfbd-458e-85af-cf1a7245a254
hrr_idhrr-5d0f6168ofname20231107134036*.decode_timestamp01998-02-15 
16:28:02.264Z.file.path�s3://test-bucket-abcde/aa//c/ddd/eee/fff..hh".file.st_size40788641.location�s3://test-bucket-abcde/aa//c/ddd/eee/fff..hh2.record_end_timestamp01998-02-15

(spark) branch master updated: [SPARK-46437][DOCS] Add custom tags for conditional Jekyll includes

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 66794197e71 [SPARK-46437][DOCS] Add custom tags for conditional Jekyll 
includes
66794197e71 is described below

commit 66794197e71fb743ff6ba678e3f5ccd520d1b1be
Author: Nicholas Chammas 
AuthorDate: Wed Jan 10 09:40:05 2024 +0900

[SPARK-46437][DOCS] Add custom tags for conditional Jekyll includes

### What changes were proposed in this pull request?

Add [custom Jekyll tags][custom] to enable us to conditionally include 
files in our documentation build in a more user-friendly manner. [This 
example][example] demonstrates how a custom tag can build on one of Jekyll's 
built-in tags.

[custom]: 
https://github.com/Shopify/liquid/wiki/Liquid-for-Programmers#create-your-own-tags
[example]: 
https://github.com/Shopify/liquid/issues/370#issuecomment-688782101

Without this change, files have to be included as follows:

```liquid
{% for static_file in site.static_files %}
{% if static_file.name == 'generated-agg-funcs-table.html' %}
{% include_relative generated-agg-funcs-table.html %}
{% break %}
{% endif %}
{% endfor %}
```

With this change, they can be included more intuitively in one of two ways:

```liquid
{% include_relative_if_exists generated-agg-funcs-table.html %}
{% include_api_gen generated-agg-funcs-table.html %}
```

`include_relative_if_exists` includes a file if it exists and substitutes 
an HTML comment if not. Use this tag when it's always OK for an include not to 
exist.

`include_api_gen` includes a file if it exists. If it doesn't, it tolerates 
the missing file only if one of the `SKIP_` flags is set. Otherwise it raises 
an error. Use this tag for includes that are generated for the language APIs. 
These files are required to generate complete documentation, but we tolerate 
their absence during development---i.e. when a skip flag is set.

`include_api_gen` will place a visible text placeholder in the document and 
post a warning to the console to indicate that missing API files are being 
tolerated.

```sh
$ SKIP_API=1 bundle exec jekyll build
Configuration file: /Users/nchammas/dev/nchammas/spark/docs/_config.yml
Source: /Users/nchammas/dev/nchammas/spark/docs
   Destination: /Users/nchammas/dev/nchammas/spark/docs/_site
 Incremental build: disabled. Enable with --incremental
  Generating...
Warning: Tolerating missing API files because the following skip flags are 
set: SKIP_API
done in 1.703 seconds.
 Auto-regeneration: disabled. Use --watch to enable.
```

This PR supersedes #44393.

### Why are the changes needed?

Jekyll does not have a succinct way to [check if a file exists][check], so 
the required directives to implement such functionality are very cumbersome.

We need the ability to do this so that we can [build the docs successfully 
with `SKIP_API=1`][build], since many includes reference files that are only 
generated when `SKIP_API` is _not_ set.

[check]: https://github.com/jekyll/jekyll/issues/7528
[build]: https://github.com/apache/spark/pull/44627

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually building and reviewing the docs, both with and without 
`SKIP_API=1`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44630 from nchammas/SPARK-46437-conditional-jekyll-include.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 docs/_plugins/conditonal_includes.rb |  55 +++
 docs/sql-ref-functions-builtin.md| 180 ---
 docs/sql-ref-functions.md|   2 +-
 3 files changed, 96 insertions(+), 141 deletions(-)

diff --git a/docs/_plugins/conditonal_includes.rb 
b/docs/_plugins/conditonal_includes.rb
new file mode 100644
index 000..39280cbe5be
--- /dev/null
+++ b/docs/_plugins/conditonal_includes.rb
@@ -0,0 +1,55 @@
+module Jekyll
+  # Tag for including a file if it exists.
+  class IncludeRelativeIfExistsTag < Tags::IncludeRelativeTag
+def render(context)
+  super
+rescue IOError
+  ""
+end
+  end
+  
+  # Tag for including files generated as part of the various language APIs.
+  # If a SKIP_ flag is set, tolerate missing files. If not, raise an error.
+  class IncludeApiGenTag < Tags::IncludeRelativeTag
+@@displayed_warning = false
+
+def render(context)
+  super
+rescue IOError => e
+  skip_flags = [
+'SKIP_API',
+'SKIP_SCALADOC',
+'SKIP_PYTHONDOC',
+

(spark) branch master updated: [MINOR][PYTHON][TESTS] Retry `test_map_in_pandas_with_column_vector`

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d2033282ecb [MINOR][PYTHON][TESTS] Retry 
`test_map_in_pandas_with_column_vector`
d2033282ecb is described below

commit d2033282ecb25d3ecc413205183b204a75a1
Author: Ruifeng Zheng 
AuthorDate: Wed Jan 10 09:38:54 2024 +0900

[MINOR][PYTHON][TESTS] Retry `test_map_in_pandas_with_column_vector`

### What changes were proposed in this pull request?
Retry `test_map_in_pandas_with_column_vector` and its parity test

### Why are the changes needed?
I am seeing this test and its parity test failing from time to time, and 
then fails `pyspark-sql` and `pyspark-connect`.

It seems due to some log4j issue, e.g


https://github.com/zhengruifeng/spark/actions/runs/7459243602/job/20294868487
```
test_map_in_pandas_with_column_vector 
(pyspark.sql.tests.pandas.test_pandas_map.MapInPandasTests) ... ERROR 
StatusConsoleListener An exception occurred processing Appender File
 java.lang.IllegalArgumentException: found 1 argument placeholders, but 
provided 0 for pattern `0, VisitedIndex{visitedIndexes={}}: [] r:0`
at 
org.apache.logging.log4j.message.ParameterFormatter.formatMessage(ParameterFormatter.java:233)
```

https://github.com/apache/spark/actions/runs/7460093200/job/20297508703
```
  test_map_in_pandas_with_column_vector 
(pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests) ... 
ERROR StatusConsoleListener An exception occurred processing Appender File
 java.lang.IllegalArgumentException: found 1 argument placeholders, but 
provided 0 for pattern `0, VisitedIndex{visitedIndexes={}}: [] r:0`
at 
org.apache.logging.log4j.message.ParameterFormatter.formatMessage(ParameterFormatter.java:233)
at 
org.apache.logging.log4j.message.ParameterizedMessage.formatTo(ParameterizedMessage.java:266)
at
```

this PR simply attempt to retry it after failures

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44641 from zhengruifeng/py_test_retry_mip.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/pandas/test_pandas_map.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/pandas/test_pandas_map.py 
b/python/pyspark/sql/tests/pandas/test_pandas_map.py
index ec9f208d08f..8f7229e1d74 100644
--- a/python/pyspark/sql/tests/pandas/test_pandas_map.py
+++ b/python/pyspark/sql/tests/pandas/test_pandas_map.py
@@ -31,7 +31,7 @@ from pyspark.testing.sqlutils import (
 pandas_requirement_message,
 pyarrow_requirement_message,
 )
-from pyspark.testing.utils import QuietTest
+from pyspark.testing.utils import QuietTest, eventually
 
 if have_pandas:
 import pandas as pd
@@ -381,6 +381,7 @@ class MapInPandasTestsMixin:
 self.assertEqual(sorted(actual), sorted(expected))
 
 # SPARK-33277
+@eventually(timeout=180, catch_assertions=True)
 def test_map_in_pandas_with_column_vector(self):
 path = tempfile.mkdtemp()
 shutil.rmtree(path)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46593][PS][TESTS] Refactor `data_type_ops` tests again

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2cf07f9b459 [SPARK-46593][PS][TESTS] Refactor `data_type_ops` tests 
again
2cf07f9b459 is described below

commit 2cf07f9b4595fa2e2a31fad913244718e9646dcb
Author: Ruifeng Zheng 
AuthorDate: Wed Jan 10 09:38:07 2024 +0900

[SPARK-46593][PS][TESTS] Refactor `data_type_ops` tests again

### What changes were proposed in this pull request?
Refactor `data_type_ops` tests again (the previous pr 
https://github.com/apache/spark/pull/44592 has been reverted)

### Why are the changes needed?
make `OpsTestBase` reusable and reuse it in the parity tests

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44637 from zhengruifeng/ps_test_rere_data_type_ops_again.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 .../connect/data_type_ops/test_parity_as_type.py   |  12 +-
 .../connect/data_type_ops/test_parity_base.py  |   5 +-
 .../data_type_ops/test_parity_binary_ops.py|   7 +-
 .../data_type_ops/test_parity_boolean_ops.py   |  12 +-
 .../data_type_ops/test_parity_categorical_ops.py   |  12 +-
 .../data_type_ops/test_parity_complex_ops.py   |   7 +-
 .../connect/data_type_ops/test_parity_date_ops.py  |  12 +-
 .../data_type_ops/test_parity_datetime_ops.py  |  12 +-
 .../connect/data_type_ops/test_parity_null_ops.py  |   7 +-
 .../data_type_ops/test_parity_num_arithmetic.py|  12 +-
 .../connect/data_type_ops/test_parity_num_ops.py   |  12 +-
 .../data_type_ops/test_parity_num_reverse.py   |  12 +-
 .../data_type_ops/test_parity_string_ops.py|  12 +-
 .../data_type_ops/test_parity_timedelta_ops.py |  12 +-
 .../connect/data_type_ops/test_parity_udt_ops.py   |   7 +-
 .../tests/connect/data_type_ops/testing_utils.py   | 211 -
 .../pandas/tests/data_type_ops/test_as_type.py |   7 +-
 .../pandas/tests/data_type_ops/test_base.py|   5 +-
 .../pandas/tests/data_type_ops/test_binary_ops.py  |   7 +-
 .../pandas/tests/data_type_ops/test_boolean_ops.py |   7 +-
 .../tests/data_type_ops/test_categorical_ops.py|   7 +-
 .../pandas/tests/data_type_ops/test_complex_ops.py |   7 +-
 .../pandas/tests/data_type_ops/test_date_ops.py|   8 +-
 .../tests/data_type_ops/test_datetime_ops.py   |   7 +-
 .../pandas/tests/data_type_ops/test_null_ops.py|   7 +-
 .../tests/data_type_ops/test_num_arithmetic.py |   7 +-
 .../pandas/tests/data_type_ops/test_num_ops.py |   7 +-
 .../pandas/tests/data_type_ops/test_num_reverse.py |   7 +-
 .../pandas/tests/data_type_ops/test_string_ops.py  |   7 +-
 .../tests/data_type_ops/test_timedelta_ops.py  |   7 +-
 .../pandas/tests/data_type_ops/test_udt_ops.py |   7 +-
 .../pandas/tests/data_type_ops/testing_utils.py|   7 +-
 32 files changed, 177 insertions(+), 298 deletions(-)

diff --git 
a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py 
b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py
index a2a9e28a5ab..205b937fb51 100644
--- a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py
+++ b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_as_type.py
@@ -16,19 +16,19 @@
 #
 import unittest
 
-from pyspark import pandas as ps
 from pyspark.pandas.tests.data_type_ops.test_as_type import AsTypeTestsMixin
-from pyspark.pandas.tests.connect.data_type_ops.testing_utils import 
OpsTestBase
+from pyspark.pandas.tests.data_type_ops.testing_utils import OpsTestBase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
 class AsTypeParityTests(
-AsTypeTestsMixin, PandasOnSparkTestUtils, OpsTestBase, 
ReusedConnectTestCase
+AsTypeTestsMixin,
+PandasOnSparkTestUtils,
+OpsTestBase,
+ReusedConnectTestCase,
 ):
-@property
-def psdf(self):
-return ps.from_pandas(self.pdf)
+pass
 
 
 if __name__ == "__main__":
diff --git 
a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_base.py 
b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_base.py
index c277f5ce066..1623db58af3 100644
--- a/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_base.py
+++ b/python/pyspark/pandas/tests/connect/data_type_ops/test_parity_base.py
@@ -20,7 +20,10 @@ from pyspark.pandas.tests.data_type_ops.test_base import 
BaseTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 
 
-class BaseParityTests(BaseTestsMixin, ReusedConnectTestCase):
+class BaseParityTests(
+BaseTestsMixin,
+ReusedConnectTestCase,
+):
 pass

(spark) branch master updated: [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 71468ebcc85e [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark 
doc website
71468ebcc85e is described below

commit 71468ebcc85e2694935086dcf0b01bfe2bff745f
Author: Gengliang Wang 
AuthorDate: Wed Jan 10 09:32:30 2024 +0900

[SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website

### What changes were proposed in this pull request?

Enhance the Visual Appeal of Spark doc website after 
https://github.com/apache/spark/pull/40269:
 1. There is a weird indent on the top right side of the first 
paragraph of the Spark 3.5.0 doc overview page
Before this PR
https://github.com/apache/spark/assets/1097932/84d21ca1-a4d0-4bd4-8f20-a34fa5db4000;>

After this PR:
https://github.com/apache/spark/assets/1097932/4ffc0d5a-ed75-44c5-b20a-475ff401afa8;>

 2. All the titles are too big and therefore less readable. In the 
website https://spark.apache.org/downloads.html, titles are h2 while in doc 
site https://spark.apache.org/docs/latest/ titles are h1. So we should make the 
font size of titles smaller.
Before this PR:
https://github.com/apache/spark/assets/1097932/5bbbd9eb-432a-42c0-98be-ff00a9099cd6;>
After this PR:
https://github.com/apache/spark/assets/1097932/dc94c1fb-6ac1-41a8-b4a4-19b3034125d7;>

 3. The banner image can't be displayed correct. Even when it shows up, 
it will be hover by the text. To make it simple, let's not show the banner 
image as we did in https://spark.apache.org/docs/3.4.2/
https://github.com/apache/spark/assets/1097932/f6d34261-a352-44e2-9633-6e96b311a0b3;>
https://github.com/apache/spark/assets/1097932/c49ce6b6-13d9-4d8f-97a9-7ed8b037be57;>

### Why are the changes needed?

Improve the Visual Appeal of Spark doc website

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually build doc and verify on local setup.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44642 from gengliangwang/enhance_doc.

Authored-by: Gengliang Wang 
Signed-off-by: Hyukjin Kwon 
---
 docs/_layouts/global.html  |  26 +++---
 docs/css/custom.css|  35 ++-
 docs/img/spark-hero-thin-light.jpg | Bin 278664 -> 0 bytes
 3 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 03f66acb12d8..6acffe8a405d 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -137,25 +137,21 @@
 
 {% if page.url == "/" %}
 
-
-
 
 
   Apache Spark - A Unified 
engine for large-scale data analytics
 
-
-  
-Apache Spark is a unified analytics engine for large-scale 
data processing.
-It provides high-level APIs in Java, Scala, Python and R,
-and an optimized engine that supports general execution 
graphs.
-It also supports a rich set of higher-level tools including
-Spark SQL for SQL 
and structured data processing,
-pandas API on Spark 
for pandas workloads,
-MLlib for machine learning,
-GraphX for 
graph processing,
- and Structured Streaming
- for incremental computation and stream processing.
-  
+
+  Apache Spark is a unified analytics engine for large-scale 
data processing.
+  It provides high-level APIs in Java, Scala, Python and R,
+  and an optimized engine that supports general execution 
graphs.
+  It also supports a rich set of higher-level tools including
+  Spark SQL for SQL 
and structured data processing,
+  pandas API on Spark 
for pandas workloads,
+  MLlib for machine learning,
+  GraphX for graph 
processing,
+   and Structured Streaming
+   for incremental computation and stream processing.
 
 
   
diff --git a/docs/css/custom.css b/docs/css/custom.css
index e80ca506a74c..51e89066e4d5 100644
--- a/docs/css/custom.css
+++ b/docs/css/custom.css
@@ -96,18 +96,7 @@ section {
   border-color: transparent;
 }
 
-.hero-banner .bg {
-  background: url(/img/spark-hero-thin-light.jpg) no-repeat;
-  transform: translate(36%, 0%);
-  height: 475px;
-  top: 0;
-  position: absolute;
-  right: 0;

(spark) branch master updated: [SPARK-46630][SQL] XML: Validate XML element name on write

2024-01-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f76502202aca [SPARK-46630][SQL] XML: Validate XML element name on write
f76502202aca is described below

commit f76502202aca511cc64136bf664ee4ab0c58d666
Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
AuthorDate: Wed Jan 10 09:11:53 2024 +0900

[SPARK-46630][SQL] XML: Validate XML element name on write

### What changes were proposed in this pull request?
Validate XML element name on write. Spark SQL permits spaces in field names 
and they may even start with number or special characters. These field names 
cannot be converted to XML element names. This PR adds validation to throw 
error on non-compliant XML element names.
This applies only to XML write. Validation is on by default. User can 
choose to disable this validation.

### Why are the changes needed?
Same as above

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
New unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44634 from sandip-db/SPARK-46630-xml-validate-element-name.

Lead-authored-by: Sandip Agarwala 
<131817656+sandip...@users.noreply.github.com>
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 docs/sql-data-sources-xml.md   | 11 -
 python/pyspark/sql/connect/readwriter.py   |  2 +
 python/pyspark/sql/readwriter.py   |  2 +
 .../spark/sql/catalyst/xml/StaxXmlGenerator.scala  |  1 +
 .../apache/spark/sql/catalyst/xml/XmlOptions.scala |  2 +
 .../sql/execution/datasources/xml/XmlSuite.scala   | 54 +-
 6 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/docs/sql-data-sources-xml.md b/docs/sql-data-sources-xml.md
index b10e054634ed..3b735191fc42 100644
--- a/docs/sql-data-sources-xml.md
+++ b/docs/sql-data-sources-xml.md
@@ -94,7 +94,7 @@ Data source options of XML can be set via:
   
   inferSchema
   true
-  If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.
+  If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type.
   read
   
 
@@ -108,7 +108,7 @@ Data source options of XML can be set via:
   
 attributePrefix
 _
-The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.
+The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Can be empty for reading XML, but not 
for writing.
 read/write
   
 
@@ -235,5 +235,12 @@ Data source options of XML can be set via:
 write
   
 
+  
+  validateName
+  true
+  If true, throws error on XML element name validation failure. For 
example, SQL field names can have spaces, but XML element names cannot.
+  write
+  
+
 
 Other generic options can be found in https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html;>
 Generic File Source Options.
diff --git a/python/pyspark/sql/connect/readwriter.py 
b/python/pyspark/sql/connect/readwriter.py
index 52975917ea02..51698f262fc5 100644
--- a/python/pyspark/sql/connect/readwriter.py
+++ b/python/pyspark/sql/connect/readwriter.py
@@ -792,6 +792,7 @@ class DataFrameWriter(OptionUtils):
 timestampFormat: Optional[str] = None,
 compression: Optional[str] = None,
 encoding: Optional[str] = None,
+validateName: Optional[bool] = None,
 ) -> None:
 self.mode(mode)
 self._set_opts(
@@ -806,6 +807,7 @@ class DataFrameWriter(OptionUtils):
 timestampFormat=timestampFormat,
 compression=compression,
 encoding=encoding,
+validateName=validateName,
 )
 self.format("xml").save(path)
 
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index b61284247b0e..db9220fc48bb 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -2096,6 +2096,7 @@ class DataFrameWriter(OptionUtils):
 timestampFormat: Optional[str] = None,
 compression: Optional[str] = None,
 encoding: Optional[str] = None,
+validateName: Optional[bool] = None,
 ) -> None:
 r"""Saves the content of the :class:`DataFrame` in XML format at the 
specified path.
 
@@ -2155,6 +2156,7 @@ class DataFrameWriter(OptionUtils):
 timestampFormat=timestampFormat,

(spark) branch master updated: [SPARK-46634][SQL] literal validation should not drill down to null fields

2024-01-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 56dd1f7c101e [SPARK-46634][SQL] literal validation should not drill 
down to null fields
56dd1f7c101e is described below

commit 56dd1f7c101ed0db7a6fcb7ac2f6f06136ac3d37
Author: Wenchen Fan 
AuthorDate: Tue Jan 9 08:58:54 2024 -0800

[SPARK-46634][SQL] literal validation should not drill down to null fields

### What changes were proposed in this pull request?

This fixes a minor bug in literal validation. The contract of `InternalRow` 
is people should call `isNullAt` instead of relying on the `get` function to 
return null. `InternalRow` is an abstract class and it's not guaranteed that 
the `get` function can work for null field. This PR fixes the literal 
validation to check `isNullAt` before getting the field value.

### Why are the changes needed?

Fix bugs for specific `InternalRow` implementations.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44640 from cloud-fan/literal.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/expressions/literals.scala  |  4 +++-
 .../catalyst/expressions/LiteralExpressionSuite.scala  | 18 ++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
index 79b2985adc1d..6c72afae91e9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
@@ -243,7 +243,9 @@ object Literal {
   v.isInstanceOf[InternalRow] && {
 val row = v.asInstanceOf[InternalRow]
 st.fields.map(_.dataType).zipWithIndex.forall {
-  case (fieldDataType, i) => doValidate(row.get(i, fieldDataType), 
fieldDataType)
+  case (fieldDataType, i) =>
+// Do not need to validate null values.
+row.isNullAt(i) || doValidate(row.get(i, fieldDataType), 
fieldDataType)
 }
   }
 case _ => false
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala
index f63b60f5ebba..d42e0b7d681d 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala
@@ -478,6 +478,24 @@ class LiteralExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   UTF8String.fromString("Spark SQL"))
   }
 
+  // A generic internal row that throws exception when accessing null values
+  class NullAccessForbiddenGenericInternalRow(override val values: Array[Any])
+extends GenericInternalRow(values) {
+override def get(ordinal: Int, dataType: DataType): AnyRef = {
+  if (values(ordinal) == null) {
+throw new RuntimeException(s"Should not access null field at 
$ordinal!")
+  }
+  super.get(ordinal, dataType)
+}
+  }
+
+  test("SPARK-46634: literal validation should not drill down to null fields") 
{
+val exceptionInternalRow = new 
NullAccessForbiddenGenericInternalRow(Array(null, 1))
+val schema = StructType.fromDDL("id INT, age INT")
+// This should not fail because it should check whether the field is null 
before drilling down
+Literal.validateLiteralValue(exceptionInternalRow, schema)
+  }
+
   test("SPARK-46604: Literal support immutable ArraySeq") {
 import org.apache.spark.util.ArrayImplicits._
 val immArraySeq = Array(1.0, 4.0).toImmutableArraySeq


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46622][CORE] Override `toString` method for `o.a.s.network.shuffledb.StoreVersion`

2024-01-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ef7846aa236e [SPARK-46622][CORE] Override `toString` method for 
`o.a.s.network.shuffledb.StoreVersion`
ef7846aa236e is described below

commit ef7846aa236e897e81964239c652c9ced2ff5a82
Author: yangjie01 
AuthorDate: Tue Jan 9 08:28:45 2024 -0800

[SPARK-46622][CORE] Override `toString` method for 
`o.a.s.network.shuffledb.StoreVersion`

### What changes were proposed in this pull request?
This pr aims to override `toString` method for 
`o.a.s.network.shuffledb.StoreVersion`

### Why are the changes needed?
Avoid displaying `StoreVersionhashCode` in the `IOException` thrown after 
the checkVersion check fails in RocksDBProvider/LevelDBProvider, show something 
like:

```
cannot read state DB with version 
org.apache.spark.network.shuffledb.StoreVersion1f, incompatible with current 
version org.apache.spark.network.shuffledb.StoreVersion3e
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44624 from LuciferYang/SPARK-46622.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/network/shuffledb/StoreVersion.java  |  5 ++
 .../apache/spark/network/util/DBProviderSuite.java | 61 ++
 2 files changed, 66 insertions(+)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/shuffledb/StoreVersion.java
 
b/common/network-common/src/main/java/org/apache/spark/network/shuffledb/StoreVersion.java
index c138163d21e1..e5887d353dd7 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/shuffledb/StoreVersion.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/shuffledb/StoreVersion.java
@@ -54,4 +54,9 @@ public class StoreVersion {
 result = 31 * result + minor;
 return result;
 }
+
+@Override
+public String toString() {
+  return "StoreVersion[" + major + "." + minor + ']';
+}
 }
diff --git 
a/common/network-common/src/test/java/org/apache/spark/network/util/DBProviderSuite.java
 
b/common/network-common/src/test/java/org/apache/spark/network/util/DBProviderSuite.java
new file mode 100644
index ..e258b9e6ff40
--- /dev/null
+++ 
b/common/network-common/src/test/java/org/apache/spark/network/util/DBProviderSuite.java
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.util;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.commons.lang3.SystemUtils;
+import org.apache.spark.network.shuffledb.DBBackend;
+import org.apache.spark.network.shuffledb.StoreVersion;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import java.io.File;
+import java.io.IOException;
+
+import static org.junit.jupiter.api.Assumptions.assumeFalse;
+
+public class DBProviderSuite {
+
+  @Test
+  public void testRockDBCheckVersionFailed() throws IOException {
+testCheckVersionFailed(DBBackend.ROCKSDB, "rocksdb");
+  }
+
+  @Test
+  public void testLevelDBCheckVersionFailed() throws IOException {
+assumeFalse(SystemUtils.IS_OS_MAC_OSX && 
SystemUtils.OS_ARCH.equals("aarch64"));
+testCheckVersionFailed(DBBackend.LEVELDB, "leveldb");
+  }
+
+  private void testCheckVersionFailed(DBBackend dbBackend, String namePrefix) 
throws IOException {
+String root = System.getProperty("java.io.tmpdir");
+File dbFile = JavaUtils.createDirectory(root, namePrefix);
+try {
+  StoreVersion v1 = new StoreVersion(1, 0);
+  ObjectMapper mapper = new ObjectMapper();
+  DBProvider.initDB(dbBackend, dbFile, v1, mapper).close();
+  StoreVersion v2 = new StoreVersion(2, 0);
+  IOException ioe = Assertions.assertThrows(IOException.class, () ->
+DBProvider.initDB(dbBackend, dbFile, v2, mapper));
+

(spark) branch master updated: [SPARK-46331][SQL] Removing CodegenFallback from subset of DateTime expressions and version() expression

2024-01-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ee1fd926802b [SPARK-46331][SQL] Removing CodegenFallback from subset 
of DateTime expressions and version() expression
ee1fd926802b is described below

commit ee1fd926802bb75f901ca72ab9b0c144c5eae035
Author: Aleksandar Tomic 
AuthorDate: Tue Jan 9 19:20:24 2024 +0800

[SPARK-46331][SQL] Removing CodegenFallback from subset of DateTime 
expressions and version() expression

### What changes were proposed in this pull request?

This PR moves us a bit closer to removing CodegenFallback class and instead 
of it relying on RuntimeReplaceable with StaticInvoke.

In this PR there are following changes:
- Doing StaticInvoke + RuntimeReplaceable against spark version expression.
- Adding Unevaluable trait for DateTime expressions. These expressions need 
to be replaced during analysis anyhow so we explicitly forbid eval from being 
called.

### Why are the changes needed?

Direction is to get away from CodegenFallback. This PR moves us closer to 
that destination.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Running existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44261 from dbatomic/codegenfallback_removal.

Lead-authored-by: Aleksandar Tomic 
Co-authored-by: Aleksandar Tomic 
<150942779+dbato...@users.noreply.github.com>
Signed-off-by: Wenchen Fan 
---
 .../catalyst/expressions/ExpressionImplUtils.java  | 13 ++
 .../catalyst/analysis/ResolveInlineTables.scala| 11 +++--
 .../sql/catalyst/analysis/ResolveTableSpec.scala   | 12 +-
 .../catalyst/expressions/datetimeExpressions.scala | 25 ++-
 .../spark/sql/catalyst/expressions/misc.scala  | 15 +++
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  7 +--
 .../org/apache/spark/sql/internal/SQLConf.scala| 14 ++
 .../expressions/DateExpressionsSuite.scala | 50 +-
 .../optimizer/ComputeCurrentTimeSuite.scala| 14 ++
 .../catalyst/optimizer/EliminateSortsSuite.scala   |  4 +-
 .../optimizer/FoldablePropagationSuite.scala   | 24 +--
 .../sql/TableOptionsConstantFoldingSuite.scala |  4 +-
 12 files changed, 96 insertions(+), 97 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java
index b4fb9eae48da..8fe59cb7fae5 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java
@@ -17,8 +17,10 @@
 
 package org.apache.spark.sql.catalyst.expressions;
 
+import org.apache.spark.SparkBuildInfo;
 import org.apache.spark.sql.errors.QueryExecutionErrors;
 import org.apache.spark.unsafe.types.UTF8String;
+import org.apache.spark.util.VersionUtils;
 
 import javax.crypto.Cipher;
 import javax.crypto.spec.GCMParameterSpec;
@@ -143,6 +145,17 @@ public class ExpressionImplUtils {
 );
   }
 
+  /**
+   * Function to return the Spark version.
+   * @return
+   *  Space separated version and revision.
+   */
+  public static UTF8String getSparkVersion() {
+String shortVersion = 
VersionUtils.shortVersion(SparkBuildInfo.spark_version());
+String revision = SparkBuildInfo.spark_revision();
+return UTF8String.fromString(shortVersion + " " + revision);
+  }
+
   private static SecretKeySpec getSecretKeySpec(byte[] key) {
 return switch (key.length) {
   case 16, 24, 32 -> new SecretKeySpec(key, 0, key.length, "AES");
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
index 811e02b4d97b..3b9c6799bfaf 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala
@@ -68,17 +68,16 @@ object ResolveInlineTables extends Rule[LogicalPlan]
   /**
* Validates that all inline table data are valid expressions that can be 
evaluated
* (in this they must be foldable).
-   *
+   * Note that nondeterministic expressions are not supported since they are 
not foldable.
+   * Exception are CURRENT_LIKE expressions, which are replaced by a literal 
in later stages.
* This is package visible for unit testing.
*/
   private[analysis] def validateInputEvaluable(table: UnresolvedInlineTable): 
Unit = {

(spark) branch master updated (ee2a87b4642c -> 8fa794b13195)

2024-01-09 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ee2a87b4642c [SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid 
decimal test case when ANSI mode is on
 add 8fa794b13195 [SPARK-46627][SS][UI] Fix timeline tooltip content on 
streaming ui

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/org/apache/spark/ui/static/streaming-page.js | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46437][FOLLOWUP] Update configuration.md to use include_api_gen

(spark) branch master updated (a3991b17e379 -> 63758177e9c6)

(spark) branch master updated: [SPARK-46648][SQL] Use `zstd` as the default ORC compression

(spark) branch master updated: [SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC

(spark) branch master updated: [MINOR][DOCS] Correct the usage example of Dataset in Java

(spark) branch master updated: [SPARK-46649][PYTHON][INFRA] Run PyPy 3 and Python 3.10 tests independently

(spark) branch master updated (0791e9f302fb -> 4957c1a5fd42)

(spark) branch master updated: [SPARK-46536][SQL] Support GROUP BY calendar_interval_type

(spark) branch master updated: [SPARK-46646][SQL][TESTS] Improve `TPCDSQueryBenchmark` to support other file formats

(spark) branch master updated: [SPARK-46541][SQL][CONNECT] Fix the ambiguous column reference in self join

(spark) branch master updated: [SPARK-46643][SQL][TESTS] Fix ORC tests to be independent from default compression

(spark) branch master updated: [MINOR][DOCS] Add license header at docs/_plugins

(spark) branch branch-3.5 updated: [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website

(spark) branch master updated: [SPARK-37039][PS] Fix `Series.astype` to work properly with missing value

(spark) branch master updated: [MINOR][INFRA] Ensure that docs build successfully with SKIP_API=1

(spark) branch master updated: [SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks

(spark) branch master updated: [SPARK-46437][DOCS] Add custom tags for conditional Jekyll includes

(spark) branch master updated: [MINOR][PYTHON][TESTS] Retry `test_map_in_pandas_with_column_vector`

(spark) branch master updated: [SPARK-46593][PS][TESTS] Refactor `data_type_ops` tests again

(spark) branch master updated: [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website

(spark) branch master updated: [SPARK-46630][SQL] XML: Validate XML element name on write

(spark) branch master updated: [SPARK-46634][SQL] literal validation should not drill down to null fields

(spark) branch master updated: [SPARK-46622][CORE] Override `toString` method for `o.a.s.network.shuffledb.StoreVersion`

(spark) branch master updated: [SPARK-46331][SQL] Removing CodegenFallback from subset of DateTime expressions and version() expression

(spark) branch master updated (ee2a87b4642c -> 8fa794b13195)

25 matches

Site Navigation

Mail list logo

Footer information